Majority of Americans are already suffering from heart disease. The prevalence of heart disease in the US is shown in the map below.
About 90% of the risk for the first heart attack is because of conditions or lifestyle factors that are preventable
Lifestyle risk factors
There is plenty of room for improvement when it comes to raising awareness for cardiovascular risk, based on a recent study that found one in five adults at risk for heart disease don’t recognize a need to improve their health. We believe that a better understanding of risk factors underlying health perceptions and behaviors is needed to capitalize on cardiovascular preventive efforts.
We sought to examine the risk factors for heart disease and their prevalence in New York City. We were also interested in examining novel ways of visualizing the correlation of indivisual risk factors with heart disease and stroke based on our knowledge gained in the Data Science class. In addition, we also wanted to find ways of predicting and visualizing an individual’s risk for heart disease. As such, our questions were as follows:
In order to address the first two questions, we utilized the 500 Cities: Local Data for Better Health dataset. The 500 Cities project is a collaboration between CDC, the Robert Wood Johnson Foundation, and the CDC Foundation. The purpose of the 500 Cities Project is to provide city- and census tract-level small area estimates for chronic disease risk factors, health outcomes, and clinical preventive service use for the largest 500 cities in the United States. These small area estimates will allow cities and local health departments to better understand the burden and geographic distribution of health-related variables in their jurisdictions, and assist them in planning public health interventions. Since we were interested in visualizing these data only for New York city, we filtered the dataset to include data only from New York city.
Risk factors that we’re interested in:: (15 total)
Outcomes that we’re interested in:: (2 total)
In order to predict risk for heart disease for an individual, we utilized the Framigham Risk Score. The Framingham Risk Score is a gender-specific algorithm used to estimate the 10-year cardiovascular risk of an individual. The Framingham Risk Score was first developed based on data obtained from the Framingham Heart Study, to estimate the 10-year risk of developing coronary heart disease. We utilized the algorithm to develop visualization in ShinyApp as discussed below.
We scraped the data directly from the web as seen in the code below and called it cvrisk.
cvrisk_url = "https://data.cdc.gov/api/views/6vp6-wxuq/rows.csv?accessType=DOWNLOAD"
## Read in data from github
cvrisk =
read.csv(url(cvrisk_url)) %>%
janitor::clean_names() %>%
as_tibble()
We then restricted our database to New York City and included only the variables that were useful for further data visualization. We called this database nyc_cvrisk.
nyc_cvrisk =
cvrisk %>%
filter(state_desc == "New York", city_name == "New York",
geographic_level == "Census Tract",
!is.na(data_value),
year == 2016,
measure_id %in% c("ACCESS2", "BINGE", "BPHIGH", "BPMED", "OBESITY", "CHECKUP",
"CHOLSCREEN", "CSMOKING", "DIABETES", "HIGHCHOL", "KIDNEY",
"LPA", "MHLTH", "PHLTH", "SLEEP", "CHD", "STROKE")) %>%
droplevels() %>%
select(unique_id, population_count, measure_id, data_value, short_question_text)
We first created another dataset that can be used for the animated scatterplot such that the outcome measure for coronary artery disease appears as a separate variable in the dataset.
nyc_cvrisk_limited = nyc_cvrisk %>%
filter(short_question_text %in% c("Annual Checkup", "Binge Drinking", "Cholesterol Screening", "Chronic Kidney Disease", "Current Smoking", "Diabetes", "Health Insurance", "High Blood Pressure", "High Cholesterol", "Mental Health", "Obesity", "Physical Health", "Physical Inactivity", "Sleep <7 hours", "Taking BP Medication")) %>%
droplevels() %>%
mutate(risk_factor = short_question_text) %>%
select(-short_question_text, -measure_id)
nyc_cad = nyc_cvrisk %>%
filter(short_question_text == "Coronary Heart Disease") %>%
droplevels() %>%
select(-measure_id, -short_question_text, -population_count)
nyc_stroke = nyc_cvrisk %>%
filter(short_question_text == "Stroke") %>%
droplevels() %>%
select(-measure_id, -short_question_text, -population_count)
nyc_cvrisk_joined = left_join(nyc_cvrisk_limited, nyc_cad, by = "unique_id")
nyc_cvrisk_joined_stroke = left_join(nyc_cvrisk_limited, nyc_stroke, by = "unique_id")
We then installed the package required for animation
devtools::install_github('thomasp85/gganimate')
We then created an animated scatterplot with Coronary Heart DIsease prevalence on the X-axis and different risk factors on the Y-axis
theme_set(theme_bw()) # pre-set the bw theme.
## using transition_length and state_length
library(ggplot2)
library(gganimate)
p = nyc_cvrisk_joined %>%
ggplot(aes(x = data_value.y, y = data_value.x, frame = risk_factor)) +
geom_point(aes(size = population_count, colour = risk_factor ),
alpha = 0.5) +
xlim(0, 10) +
labs(title = "{closest_state}",
x = 'Coronary Heart Disease Prevalence',
y = 'Risk Factor Prevalence',
colour = 'Risk Factors',
size = 'Population Count') +
theme(plot.title = element_text(size = 40, face = "bold"),
axis.text=element_text(size=18),
axis.title=element_text(size=18,face="bold")) +
theme(legend.text=element_text(size=16), legend.title=element_text(size=18,face="bold") ) +
# gganimate parts
transition_states(risk_factor, transition_length = 1, state_length = 3, wrap = TRUE) +
enter_fade() +
exit_fade()
animate(p, fps = 2, height = 600, width = 1000, renderer = gifski_renderer())
We created a similar animation for stroke as shown below:
theme_set(theme_bw()) # pre-set the bw theme.
pp = nyc_cvrisk_joined_stroke %>%
ggplot(aes(x = data_value.y, y = data_value.x, frame = risk_factor)) +
geom_point(aes(size = population_count, colour = risk_factor ),
alpha = 0.5) +
xlim(0, 10) +
labs(title = "{closest_state}",
x = 'Stroke Prevalence',
y = 'Risk Factor Prevalence',
colour = 'Risk Factors',
size = 'Population Count') +
theme(plot.title = element_text(size = 40, face = "bold"),
axis.text=element_text(size=18),
axis.title=element_text(size=18,face="bold")) +
theme(legend.text=element_text(size=16), legend.title=element_text(size=18,face="bold") ) +
# gganimate parts
transition_states(risk_factor, transition_length = 1, state_length = 3, wrap = TRUE) +
enter_fade() +
exit_fade()
animate(pp, fps = 2, height = 600, width = 1000, renderer = gifski_renderer())
To make the correlation plot, we first transformed our dataset in the wide format and called it nyc_cvrisk_wide.
nyc_cvrisk_wide =
nyc_cvrisk %>%
select(-measure_id)%>%
spread(key = short_question_text, value = data_value) %>%
janitor::clean_names()
The correlation plot for coronary heart disease and risk factors of interest is as follows:
nyc_cvrisk_wide %>%
select(annual_checkup:sleep_7_hours) %>%
select("coronary_heart_disease", everything())%>%
rename("Coronary Disease" = coronary_heart_disease,
"Annual Checkup" = annual_checkup,
"Binge Drinking" = binge_drinking,
"Kidney Disease" = chronic_kidney_disease,
"Current Smoking" = current_smoking,
"Diabetes" = diabetes,
"No Insurance" = health_insurance,
"Poor Mental Health" = mental_health,
"Obesity" = obesity,
"Poor Health" = physical_health,
"Physical Inactivity" = physical_inactivity,
"Poor Sleep" = sleep_7_hours)%>%
cor() %>%
corrplot(method = "square", order = "AOE", addCoef.col = "black", tl.col="black", tl.srt=45, insig = "blank",
# hide correlation coefficient on the principal diagonal
diag=FALSE)
The correlation plot for stroke and risk factors of interest is as follows:
nyc_cvrisk_wide %>%
select(annual_checkup:stroke)%>%
select(stroke, everything()) %>%
select(-c(coronary_heart_disease))%>%
rename("Stroke" = stroke,
"Annual Checkup" = annual_checkup,
"Binge Drinking" = binge_drinking,
"Kidney Disease" = chronic_kidney_disease,
"Current Smoking" = current_smoking,
"Diabetes" = diabetes,
"No Insurance" = health_insurance,
"Poor Mental Health" = mental_health,
"Obesity" = obesity,
"Poor Physical Health" = physical_health,
"Physical Inactivity" = physical_inactivity,
"Poor Sleep" = sleep_7_hours)%>%
cor() %>%
corrplot(method = "square", order = "AOE", addCoef.col = "black", tl.col="black", tl.srt=45, insig = "blank",
# hide correlation coefficient on the principal diagonal
diag=FALSE)
We made several salient observations about correlation of risk factors with Coronary Artery Disease and Stroke.
We used the shinyApp and leaflet package to perform visualization of geographic distribution of several risk factors and outcomes in New York City. We observed that most of the risk factors were observed with higher frequency in the Harlem and Bronx area. Unhealthy behaviors like binge alcohol drinking, however, were more prevalent in downtown Manhattan.
We also used shinyApp and plotly package to build a program to calculate individual risk of heart disease using the Framingham Risk Score. This app allows for input of risk factor variables of an indivdual and predicts the 10-year cardiovascular risk using the Framingham risk score algorithm.
In the process of building this project, we gained several insights about risk factors for heart disease.
In particular, the strength of correlation of lifestyle risk factors with coronary artery disease and stroke indicate the potential for making an impact by targeting these interventions. Interestingly, alcohol intake was negatively correlated with both coronary disease and stroke. This has been observed in several meta-analyses as well. This is however, likely only an association, and not evidence of causation, as discussed in this article
Visualization of geographic variation of these risk factors in outcomes in New York City provide insights into target neighborhoods for implementation of cardiovascular risk reduction policies.
Visualization of individual risk for heart disease using patient-level inputs, and the ability to understand how modification of one input can impact the risk can be very powerful for people working toward decreasing their risk.